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Abstract. We make a connection between classical polytopes called 
zonotopes and Support Vector Machine (SVM) classifiers. We combine 
this connection with the ellipsoid method to give some new theoretical 
results on training SVMs. We also describe some special properties of 
C-SVMs for C -> oo. 

1 Introduction 

A statistical classifier algorithm maps a set of training vectors — positively and 
negatively labeled points in M. d — to a decision boundary. A Support Vector Ma- 
chine (SVM) is a classifier algorithm in which the decision boundary depends 
on only a subset of training vectors, called the support vectors jl8|| . This limited 
dependence on the training set helps give SVMs good generalizability , meaning 
that SVMs are resistant to overtraining even in the case of large d. Another key 
idea associated with SVMs is the use of a kernel function in computing the dot 
product of two training vectors. For example, the usual dot product v-w could be 
replaced by k(v, w) = (v-w) 2 (quadratic kernel) orbyfc(w,w) = exp(— \\v— w\\ 2 ) 
(radial basis function). The kernel function |l4| in effect maps the original train- 
ing vectors in ~R d into a higher-dimensional (perhaps infinite-dimensional) feature 
space R d ; a linear decision boundary in R d then determines a nonlinear deci- 
sion surface back in R d . For good introductions to SVMs see the tutorial by 
Burges || or the book by Cristianini and Shawe- Taylor ||. 

The basic maximum margin SVM applies to the case of linearly separable 
training vectors, and divides positive and negative vectors by a farthest-apart 
pair of parallel hyperplanes, as shown in Figure [|(a). The decision boundary it- 
self is typically the hyperplane halfway between the boundaries. Computational 
geometers might expect that the extension of the SVM to the non-separable 
case would divide positive and negative vectors by a least-overlapping pair of 
half-spaces bounded by parallel hyperplanes, as shown in Figure |l|(b). This gen- 
eralization, however, may be overly sensitive to outliers, and hence the method 
of choice is a more robust soft margin classifier, called a C-SVM |Q,[l8) or v- 
SVM [T^] depending upon the precise formulation. Parameter C is a user-chosen 
penalty for errors. 



Computing the maximum margin classifier for n vectors in M. d amounts to 
solving a quadratic program (QP) with about d variables and n linear con- 
straints. If the feature vectors are not explicit (that is, kernel functions are 
being used) , then the usual Lagrangian formulation gives a QP with about n + d 
variables and linear constraints. Similarly, the soft margin classifier — with or 
without explicit feature vectors — is computed in a Lagrangian formulation with 
about n + d variables and linear constraints. The jump from d to n + d variables 
can have a great impact on the running time and choice of QP algorithm. Recent 
results in computational geometry [|3||llf| give fast QP algorithms for the case of 
large n and small d, algorithms requiring about 0{nd) + (log n) exp(C(v / rf)) 
arithmetic operations. The best bound on the number of arithmetic operations 
for a QP with n + d variables and constraints is about 0((n + d) 3 L), where L 
is the precision of the input data H| . 

In this paper, we show that the jump from d to n + d is not necessary for 
soft margin classifiers with explicit feature vectors. More specifically, we describe 
training algorithms with running time near linear in n and polynomial in d and 
input precision, for two different scenarios: C set by the user and C — > oo. The 
second scenario also introduces a natural measure of separability of point sets. 
Our algorithms build upon a geometric view of soft margin classifiers and 
the ellipsoid method for convex optimization. Due to their reliance on explicit 
feature vectors and the ellipsoid method, and also due to the fact that SVMs are 
more suited to the case of moderate n and large d than to the case of large n and 
small d, our algorithms have little practical importance. On the other hand, our 
results should be interesting theoretically. We view the soft margin classifier as 
a problem defined over a zonotope, a type of polytope that admits an especially 
compact description. Accordingly, our algorithms have lower complexity than 
either the vertex or facet descriptions of the polytopes. 




Fig. 1. (a) The maximum margin SVM classifier for the separable case. The 
dashed line shows the decision boundary, (b) The most natural generalization 
to the non-separable case is not popular. 



2 SVM Formulations 



We adopt the usual SVM notation and mostly follow the presentation of Bennett 
and Bredensteiner jjj. The training vectors are Xi, X2, . . . , x„, points in M. d . The 
corresponding labels are y%, 2/2, . . . , y n , each of which is either +1 or — 1. Let 
1+ = { i I j/i = +1 } and J_ = { i \ y^ = — 1 }. We use w and x to denote vectors 
in R d and b to denote a scalar. We use the dot product notation w ■ x, but in 
this section w ■ x could be standing in for the kernel function k(w, x). 

In the maximum margin SVM we seek parallel hyperplanes defined by the 
equations w ■ x = b + and w ■ x = 6_ such that w ■ x, < 6_ for all j 6 L and 
w-Xj > b + for all z G i+. The signed distance between these two hyperplanes — the 
margin — is Tr^rr 1 and hence can be maximized by minimizing || w|| 2 — (b+ — &_). 

min ||w|| 2 — (6+ - 6-) subject to (1) 

Xi ■ w > b+ for i € J + , Xi ■ w < b- for i € /_. 

A popular choice for the decision boundary is the plane halfway between the 
parallel hyperplanes, w ■ x = (fe + + fo_)/2, and hence each unknown vector a; is 
classified according to the sign of ro • x — (6+ + 6-)/2. 

In the linearly separable case, we can set &+ = 1 — & and &_ = — 1 — b (thereby 
rescaling w) and obtain the following optimization problem, the standard form 
in most SVM treatments gj. 

min || w || 2 subject to (2) 

w, b 

Xi ■ w + b > 1 for i £ 4, Xi ■ w + b < — 1 for i 6 L. 

Notice that this QP has d + 1 variables and n linear constraints. At the solution, 
w is a linear combination of Xi's, 2/|| w|| gives the margin, and w ■ x + b = gives 
the halfway decision boundary. 

The dual problem to maximizing the distance between parallel hyperplanes 
separating the positive and negative convex hulls is to minimize the distance 
between points inside the convex hulls. Thus the dual in the separable case is 
the following. 



mm 



otiXi — QijXj subject to < oii < 1, a% = 1, a» = 1. 

(3) 

Karush-Kuhn- Tucker (complementary slackness) conditions show that the op- 
timizing value of w for is given by the optimizing values of aj for (^): 
to = X)ie/ ai2;i — SiG/_ "i 2 -!- The vectors Xj with a, > are called the support 
vectors. 

The soft margin SVM adds slack variables to formulation (0), and then pe- 
nalizes solutions proportional to the sum of these variables. Slack variable 
measures the error for training vector Xj, that is, how far Xj lies on the wrong 



side of the parallel hyperplane for Xi's class. 

n 

min ||w|| 2 + (6+ — 6_) + subject to (4) 

^ > Vi, Xi ■ w > 6-)- — £j for i € I+, ii • w < &- + & f° r i € J-. 
The standard C-SVM formulation |l8| again sets 6 + = 1 — & and 6_ = — 1 — b. 

n 

C £i subject to (5) 



min || w\\ 2 ■ 



t;i > Vi, ^ • w + b > 1 — £j for ie^ • w + & < — 1 + for i £ L. 

In formulation ||, the decision boundary is w-x = &. Formulation (^), however, 
does not set the decision boundary, but only its direction. Crisp and Burges Q 
write that because "originally the sum of £i 's term arose in an attempt to ap- 
proximate the number of errors" , the best option might be to run a "simple line 
search" to find the decision boundary that actually minimizes the number of 
training set errors. 

The dual of formulation (|J) in the separable case minimizes the distance 
between points inside "reduced" or "soft" convex hulls . 

I, 2 

min ctiXi — onxi subject to < cti < fi, on — 1, a, = 1. 

iel- iel + 

(6) 

See Figure [2| The reduced convex hull of points Xi, i G /+, is the set of convex 
combinations of ctiXi with each on < [i. (Notice that in (^J) there is no reason 
to consider fj, > 1.) We shall say more about reduced convex hulls in the next 
section. 



Fig. 2. (a) Soft margin SVMs maximize the margin between reduced convex 
hulls, (b) Although the soft margin is often explained as a way to handle non- 
separability, it can help in the separable case as well. 



The dual view highlights a slight difference between formulations ([|) and (|5|) . 
Formulation ([|) allows the direct setting of the reduced convex hulls. Parameter 
fi limits the influence of any single training point; if the user expects no more than 
four outliers in the training set, then an appropriate choice of /i might be 1/9 in 
order to ensure that the majority of the support vectors are non-outliers. If the 
reduced convex hulls intersect, the solution to (|4|) is the least-overlapping pair of 
half-spaces, as in Figure [j](b). Formulation (|5|) is also always feasible — unlike the 
standard hard margin formulation (Q) — but it never allows the reduced convex 
hulls to intersect. As C — > oo the reduced convex hulls either fill out their convex 
hulls (the separable case) or continue growing until they asymptotically touch 
(the non-separable case). 



3 Reduced Convex Hulls and Zonotopes 

Assume < fx < 1 and define the positive and negative reduced convex hulls by 



| a.iXi | otj = 1, < a, < /ij, 



i£l+ iel- 



+ 



iei- 

Figure || shows the reduced convex hull of three points x\, X2, and X3 for various 
values of \i. The reduced convex hull grows from the centroid at fi — 1/3 to the 
convex hull at fj, = 1; for fj, < 1/3 it is empty. In Figure]^, /1 is a little less than 
1/2. 

A reduced convex hull is a special case of a centroid polytope, the locus of 
possible weighted averages of points each with an unknown weight within a 
certain range (2). For reduced convex hulls, each weight has the same range 
[0, ^] and the sum of the weights is constrained to be 1. In Q we related centroid 
polytopes in R d to special polytopes, called zonotopes, in We repeat the 

connection here, specialized to the case of reduced convex hulls. 

Let Vi denote (2^,1), the vector in R d+1 that agrees with Xi on its first d 
coordinates and has 1 as its last coordinate. Define 



< a, < 



Polytope Z +li is a Minkowski sumf] of line segments of the form Si — { Ui Vi \ < 
a i < M }■ The Minkowski sum of line segments is a special type of convex poly- 
tope called a zonotope Polytope H +fl is the cross-section of Z +tl with the 
(d + l)-st coordinate (which by construction is also J2i a i) equal to one. Of 
course, -ff_ M can also be related to a zonotope in the same way. The following 
lemmas state the property of zonotopes and reduced convex hulls that underlies 
our algorithms. Lemma || is implicit in Keerthi et al.'s iterative nearest-point 
approach to SVM training || ■ 

The Minkowski sum of sets A and B in R d+1 is {p + q | p G A and q G B}. 



Fig. 3. The reduced convex hull of 3 points ranges from the centroid to the 
convex hull. 

Lemma 1. Let Z be a zonotope that is the Minkowski sum of n line segments 
in W 1 . There is an algorithm with O(nd) arithmetic operations for optimizing a 
linear function over Z. 

Proof. Assume that we are trying to find a vertex v in zonotope Z extreme in 
direction w, that is, that maximizes the dot product w ■ v. Assume that Z is 
the Minkowski sum of line segments of the form Si — {ctiVi | < a, < fi}, 
where Vi 6 R d+1 . We simply set each oti independently to or fi, depending 
upon whether the projection of Vi onto w is negative or positive. □ 

Lemma 2. There is an algorithm with 0(nd) arithmetic operations for opti- 
mizing a linear function over a reduced convex hull of n points in M. d . 

Proof. Assume that we are trying to find a vertex x in zonotope H +fl extreme 
in direction w. Order the x^s with = +1 according to their projection onto 
vector w, breaking ties arbitrarily. In decreasing order by projection along w, 
set the corresponding a^s to /x until doing so would violate the constraint that 
on — 1. Set oti for this "transitional" vector to the maximum value allowed 
by this constraint, and finally set the remaining a^'s to 0. Then x = Y]jcj 
maximizes w ■ x. □ 

An interesting combinatorial question asks for the worst-case complexity of 
a reduced convex hull H + ^. The vertex x of H +fl that is extreme for direction 
w can be associated with the set of x^s for which > 0. If /i = 1/k, then as 
in Lemma ^, :r's set is the first k points in direction w, a set of k points that 
can be separated from the other n — k points by a hyperplane normal to w. And 
conversely, each separable set of k points defines a unique vertex of H^. Hence 
the maximum number of vertices of is equal to the maximum number of 
k-sets for n points in R d , which is known to be w(n' 1 ' 1 ) and o(n d ) ]l7| , ^9| . In Q 



we showed that a more general centroid polytope in which each point Xi has cti 
between and [ii (that is, different weight bounds for different points) may have 
complexity 0(n d ). 

We can also apply the argument in the proof of Lemma ^ to say something 
about the optimizing values of the variables in (Q) and (Q) for the non-separable 
case. (Alternatively we can derive the same statements from the Karush-Kuhn- 
Tucker conditions.) Each of H + and //_ has a transition in the sorted order of 
the aij's when projected along the normal w to the parallel pair of hyperplanes. 
For Xi with i G J+, on = if Xi ■ w lies on the "right" side of the transition, 

< a% < /i if Xi ■ w coincides with the transition, and a, = /i if Xi lies on the 
"wrong" side of the transition. Of course an analogous statement holds for Xi for 

1 G I—. As usual, the support vectors are those Xi with a,i > 0. Thus all training 
set errors are support vectors. In Figure §(a) there are six support vectors: two 
transitional unfilled dots (marked Xi and Xj) and one wrong-side unfilled dot, 
along with one transitional and two wrong-side filled dots. 



4 Ellipsoid-Based Algorithms 

We first assume that /i has been fixed in advance, perhaps using some knowledge 
of the expected number of outliers or the desired number of support vectors. We 
give an algorithm for solving formulation (^). 

One approach would be to compute the vertices of H +fl and H_ „ and then 
use formulation (|]J) with positive and negative training vectors replaced by the 
vertices of and H^^ respectively. However, the number of vertices of H +fl 
and H—fj, may be very large, so this algorithm could be very slow. 

So instead we exploit a polynomial-time equivalence between separation and 
optimization (see for example |L5||, chapter 14.2). The input to the separation 
problem is a point q and a polytope P (typically given by a system of linear 
inequalities). The output is either a statement that q is inside P or a hyperplane 
separating q and P. The input to the optimization problem is a direction w and 
a polytope P. The output is either a statement that P is empty, a statement 
that P is unbounded in direction ra, ora point in P extreme for direction w. The 
two problems are related by projective duality^] and a subroutine for solving one 
can be used to solve the other in a number of calls that is polynomial in the 
dimension d and the input precision, that is, the number of bits in q or w plus 
the maximum number of bits in an inequality defining P. 

In our case, the polytope is not given by inequalities, but rather as a Minkowski 
sum of line segments; this presentation has an impact on the required precision. 
If the input precision is L, the maximum number of bits in one of the feature 
vectors Xi, then the maximum number of bits in a vertex of the polytope is 
0(d 2 Llogn). What is new is the O(logn) term, resulting from the fact that a 
vertex of a zonotope is a sum of up to n input vectors. 

2 The more famous direction of this equivalence is that separation — which can be 
solved directly by checking each inequality — implies optimization. This result is a 
corollary of Khachiyan's ellipsoid method. 



Theorem 1. Given n explicit feature vectors in M. d and (x with 1/n < fj, < 1, 
there is a polynomial-time algorithm for computing a soft margin classifier, with 
the number of arithmetic operations linear in n and polynomial in d, L, and 
log n. 

Proof. As in 0, consider the polytope P that is the Minkowski sum of H +Il 
and — H-p, that is, P = { v + — | v + 6 and u_ € iZ-^, }. We are trying 
to minimize over P the convex quadratic objective function ||u|| 2 , that is, the 
length of a line segment between H +fl and B.-^. 

For a given direction w, we can find the solution v = v + — v_ to the linear 
optimization problem for P by using Lemma || to find the v+ optimizing w over 
H +il and the v_ optimizing w over Now given a point <? € R d , we can use 

this observation and the polynomial-time equivalence between separation and 
optimization to solve the separation problem for q and P in time linear in n and 
polynomial in d and L. We can use this solution to the separation problem for P 



as a subroutine for the ellipsoid method (see 10 19]) in order to optimize \\v\\ 2 
over P. Given an optimizing choice of v = v + — v_ , it is easy to find the best 
pair of parallel hyperplanes and a decision boundary, either the C-SVM decision 
boundary or some other reasonable choice within the parallel family. □ 

Now assume that we are in the non-separable case. We shall show how to solve 
for the maximum \x for which the reduced convex hulls have non-intersecting 
interior, that is, the fi for which the margin is 0. This choice of /i corresponds 
to C — > oo and the objective function simplifying to J^i & m formulation (||). 

This choice of fi has two special properties. First, among all settings of C, 
C — > oo tends to give the fewest support vectors. To see this, imagine shrinking 
the shaded regions in Figure ^(a) . Support vectors are added each time one of 
the parallel hyperplanes crosses a training vector. On the other hand, a support 
vector may be lost occasionally when the number of reduced convex hull ver- 
tices on the parallel hyperplanes changes, for example, if the vertex supporting 
the upper parallel line in Figure ||(a) slipped off to the right of the segment 
supporting the lower parallel line. 

Second, the fi for which the margin is zero gives a natural measure of the 
separability of two point sets. For simplicity, let = | J_| = n/2 and normalize 
the zero-margin fj, by \i* — (/i — 2/n)/(l — 2/n). The separability measure \i* 
runs from to 1, with meaning that the centroids coincide and 1 meaning 
that the convex hulls have disjoint interiors. Computing the zero-margin /i as 
the maximum value of a dual variable a, using formulation (|5|) above is no 
harder than training a C-SVM, and in the case of explicit features, it should be 
significantly easier, as we now show. 

We can formulate the problem as minimizing /i subject to 

53 fc x i = 53 A = i, 53 A = i, o < ft < /*. 



As above, let Vi denote (ft, 1), the vector in R d+1 that agrees with Xi on its first 
d coordinates and has 1 as its last coordinate. Letting on = ft/^, we can rewrite 



the problem as maximizing 



i<El + iel- 



subject to 




2^ a i u , 

ie-T- 



'») 



< ai < 1. 



(7) 



Yet another way to state the problem is to ask for the point with maximum 
(d + l)-st coordinate in Z + flZ_, where 



Polytopes Z + and Z_ are each zonotopes, Minkowski sums of line segments of 
the form Si = { oiiVi | < a, < 1 }. 

Theorem 2. Let Z\ and Z2 be zonotopes defined by a total of n line segments in 
M. d . There is an algorithm for optimizing a linear objective function over Z\C\Z2, 
with the number of arithmetic operations linear in n and polynomial in d, L, and 
log n. 

Proof. Given a point q and zonotope Zi, 1 < i < 2, we can use Lemma [TJ 
and the polynomial-time equivalence between separation and optimization to 
solve the separation problem for q and Zi in time linear in n and polynomial 
in d, L and logn. We can solve the separation problem for the intersection of 
zonotopes Z\ fl Zi simply by solving it separately for each zonotope. We now use 
the equivalence between separation and optimization in the other direction to 
conclude that we can also solve the optimization problem for an intersection of 
zonotopes. □ 

The proof of the following result then follows from the ellipsoid method in 
the same way as the proof of Theorem |l|. 

Corollary 1. Given n explicit feature vectors inM. d , there is a polynomial-time 
algorithm for computing the maximum (i for which H +fl and H_ fl are linearly 
separable, with the number of arithmetic operations linear in n and polynomial 
in d, L, and log n. 

Theorem [j] and Corollary [j] can be extended to some cases of implicit feature 
vectors. For example, the quadratic kernel k(v,w) = (v ■ w) 2 for vectors v — 
(v\, V2) and w = (w\, iwa) in M 2 is equivalent to an ordinary dot product in M 3 , 
namely k(v, w) — ${v) ■ <&(w), where 4>(v) — (vf , \^2viV2, f|). In general ||, a 
polynomial kernel k(v,w) — [v ■ w) p amounts to lifting the training vectors from 
M. d to R d where d' = Radial basis functions, however, give d' = 00, 



and the SVM training problem seems to necessarily involve n + d variables. (The 
rather amazing part is that it is a combinatorial optimization problem at all!) 





5 Discussion and Conclusions 



In this paper we have connected SVMs to some recent results in computational 
geometry and mathematical programming. These connections raise some new 
questions, both practical and theoretical. 

Currently the best practical algorithms for training SVMs, Piatt's sequential 
minimal optimization (SMO) |Q and Keerthi et al.'s nearest point algorithm 
(NPA) H , can be viewed as interior-point methods that iteratively optimize the 
margin over line segments. Both algorithms make use of heuristics to find line 
segments close to the exterior, meaning line segments with a, weights set to 
either or C . 

Computational geometry may have a practical algorithm to contribute for the 
case of n large and d small, say n ~ 100,000 and d ~ 20: the generalized linear 
programming (GLP) paradigm of Matousek et al. The training vectors 

need not actually live in R d for small d, so long as the GLP dimension of the 
problem is small, where the GLP dimension is the number of support vectors in 
any subproblem defined by a subset of the training vectors. 

On the theoretical side, we are wondering about the existence of strongly 
polynomial algorithms for QP problems over zonotopes. Due to the combinatorial 
equivalence of zonotopes and arrangements, the graph diameter of a zonotopc 
is known to be only 0{n); polynomial graph diameter is of course a necessary 
condition for the existence of a polynomial-time simplex-style algorithm. 
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